import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from collections import defaultdict
import itertools
from wordcloud import WordCloud
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from textblob import TextBlob
df = pd.read_csv("North_America-United_States.csv")
df.head()
| data_id | iso | event_id_cnty | event_id_no_cnty | event_date | year | time_precision | event_type | sub_event_type | actor1 | ... | location | latitude | longitude | geo_precision | source | source_scale | notes | fatalities | timestamp | iso3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9383068 | 840 | USA44621 | 44621 | 24 June 2022 | 2022 | 1 | Protests | Excessive force against protesters | Protesters (United States) | ... | Cedar Rapids | 41.9738 | -91.6768 | 1 | KCRG-TV9; Network Contagion Research Institute... | Local partner-National | On 24 June 2022, 'pro-choice' protesters gathe... | 0 | 1656458658 | USA |
| 1 | 9383099 | 840 | USA44775 | 44775 | 24 June 2022 | 2022 | 1 | Protests | Peaceful protest | Protesters (United States) | ... | Fayetteville | 36.0626 | -94.1574 | 1 | Southwest Times Record; 40/29 | Subnational | On 24 June 2022, well over 100 people includin... | 0 | 1656458658 | USA |
| 2 | 9383100 | 840 | USA44783 | 44783 | 24 June 2022 | 2022 | 1 | Protests | Peaceful protest | Protesters (United States) | ... | Eureka | 40.8040 | -124.1654 | 1 | Lost Coast Outpost; North Coast Journal | Subnational | On 24 June 2022, hundreds of people rallied ou... | 0 | 1656458658 | USA |
| 3 | 9383101 | 840 | USA44788 | 44788 | 24 June 2022 | 2022 | 1 | Protests | Peaceful protest | Protesters (United States) | ... | West Hollywood | 34.0900 | -118.3617 | 1 | CBS News; Liveuamap | Local partner-New media | On 24 June 2022, hundreds of people rallied in... | 0 | 1656458658 | USA |
| 4 | 9383102 | 840 | USA44793 | 44793 | 24 June 2022 | 2022 | 1 | Protests | Peaceful protest | Protesters (United States) | ... | Monterey | 36.6031 | -121.8936 | 1 | Monterey County Herald; KSBW8 | Subnational | On 24 June 2022, hundreds of people rallied at... | 0 | 1656458658 | USA |
5 rows × 31 columns
df.shape
(42253, 31)
df.nunique()
data_id 42253 iso 1 event_id_cnty 42253 event_id_no_cnty 42253 event_date 905 year 3 time_precision 3 event_type 2 sub_event_type 5 actor1 6 assoc_actor_1 5428 inter1 2 actor2 26 assoc_actor_2 310 inter2 8 interaction 14 region 1 country 1 admin1 51 admin2 1224 admin3 0 location 4207 latitude 5134 longitude 5195 geo_precision 3 source 14344 source_scale 20 notes 42250 fatalities 4 timestamp 374 iso3 1 dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 42253 entries, 0 to 42252 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 data_id 42253 non-null int64 1 iso 42253 non-null int64 2 event_id_cnty 42253 non-null object 3 event_id_no_cnty 42253 non-null int64 4 event_date 42253 non-null object 5 year 42253 non-null int64 6 time_precision 42253 non-null int64 7 event_type 42253 non-null object 8 sub_event_type 42253 non-null object 9 actor1 42253 non-null object 10 assoc_actor_1 32392 non-null object 11 inter1 42253 non-null int64 12 actor2 3882 non-null object 13 assoc_actor_2 1152 non-null object 14 inter2 42253 non-null int64 15 interaction 42253 non-null int64 16 region 42253 non-null object 17 country 42253 non-null object 18 admin1 42253 non-null object 19 admin2 42186 non-null object 20 admin3 0 non-null float64 21 location 42253 non-null object 22 latitude 42253 non-null float64 23 longitude 42253 non-null float64 24 geo_precision 42253 non-null int64 25 source 42253 non-null object 26 source_scale 42253 non-null object 27 notes 42253 non-null object 28 fatalities 42253 non-null int64 29 timestamp 42253 non-null int64 30 iso3 42253 non-null object dtypes: float64(3), int64(11), object(17) memory usage: 10.0+ MB
df.isnull().sum()
data_id 0 iso 0 event_id_cnty 0 event_id_no_cnty 0 event_date 0 year 0 time_precision 0 event_type 0 sub_event_type 0 actor1 0 assoc_actor_1 9861 inter1 0 actor2 38371 assoc_actor_2 41101 inter2 0 interaction 0 region 0 country 0 admin1 0 admin2 67 admin3 42253 location 0 latitude 0 longitude 0 geo_precision 0 source 0 source_scale 0 notes 0 fatalities 0 timestamp 0 iso3 0 dtype: int64
Though there are considerable Null values in the assoc_actor_1, actor2 and assoc_actor_2. They can be used to provide interesting correlations later on to understand the cases where two actors are involved and the locations in which those took place.
as the entire column of admin3 is of no use and is entirely empty. We start by dropping that column
df.drop(columns=["admin3"], inplace=True)
From the overview we can see that, the date_type is of the type object and not date. Converting that to date type will be ideal.
df["event_date"] = pd.to_datetime(df["event_date"])
df.head()
| data_id | iso | event_id_cnty | event_id_no_cnty | event_date | year | time_precision | event_type | sub_event_type | actor1 | ... | location | latitude | longitude | geo_precision | source | source_scale | notes | fatalities | timestamp | iso3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9383068 | 840 | USA44621 | 44621 | 2022-06-24 | 2022 | 1 | Protests | Excessive force against protesters | Protesters (United States) | ... | Cedar Rapids | 41.9738 | -91.6768 | 1 | KCRG-TV9; Network Contagion Research Institute... | Local partner-National | On 24 June 2022, 'pro-choice' protesters gathe... | 0 | 1656458658 | USA |
| 1 | 9383099 | 840 | USA44775 | 44775 | 2022-06-24 | 2022 | 1 | Protests | Peaceful protest | Protesters (United States) | ... | Fayetteville | 36.0626 | -94.1574 | 1 | Southwest Times Record; 40/29 | Subnational | On 24 June 2022, well over 100 people includin... | 0 | 1656458658 | USA |
| 2 | 9383100 | 840 | USA44783 | 44783 | 2022-06-24 | 2022 | 1 | Protests | Peaceful protest | Protesters (United States) | ... | Eureka | 40.8040 | -124.1654 | 1 | Lost Coast Outpost; North Coast Journal | Subnational | On 24 June 2022, hundreds of people rallied ou... | 0 | 1656458658 | USA |
| 3 | 9383101 | 840 | USA44788 | 44788 | 2022-06-24 | 2022 | 1 | Protests | Peaceful protest | Protesters (United States) | ... | West Hollywood | 34.0900 | -118.3617 | 1 | CBS News; Liveuamap | Local partner-New media | On 24 June 2022, hundreds of people rallied in... | 0 | 1656458658 | USA |
| 4 | 9383102 | 840 | USA44793 | 44793 | 2022-06-24 | 2022 | 1 | Protests | Peaceful protest | Protesters (United States) | ... | Monterey | 36.6031 | -121.8936 | 1 | Monterey County Herald; KSBW8 | Subnational | On 24 June 2022, hundreds of people rallied at... | 0 | 1656458658 | USA |
5 rows × 30 columns
df['year'] = df['event_date'].dt.year.astype(str)
yearly_counts = df['year'].value_counts().reset_index()
yearly_counts.columns = ['year', 'count']
fig = px.pie(yearly_counts, names="year", values="count", title="Number of Protests per Year")
fig.show()
df['year_month'] = df['event_date'].dt.strftime('%Y-%m')
monthly_protest_counts = df.groupby('year_month').size().reset_index(name='protest_count')
fig = px.bar(monthly_protest_counts, x='year_month', y='protest_count',
title='Monthly Protest Count Per Year',
labels={'year_month': 'Year-Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
From the previous plot we have observed that majority of the data comes from 2020. This bar corroborates that idea.
Top 5 number of protests that occured across 2020 - 2022 came during the months May - Sept of 2020.
event_counts = df['event_type'].value_counts().reset_index()
event_counts.columns = ['event_type', 'count']
fig = px.pie(event_counts, names="event_type", values="count", title="Count of Event types")
fig.show()
sub_event_counts = df['sub_event_type'].value_counts().reset_index()
sub_event_counts.columns = ['sub_event_type', 'count']
fig = px.pie(sub_event_counts, names="sub_event_type", values="count",hole = 0.3, title="Count of Sub Event types")
fig.update_xaxes(tickangle=-90)
fig.show()
In line with the previous graph, we can see that most of the Sub events are Peaceful Protests. This also explains the majority of it being protests.
actor1_counts = df['actor1'].value_counts().reset_index()
actor1_counts.columns = ['actor1', 'count']
fig = px.bar(actor1_counts, x="actor1", y="count", title="Count of Actor1")
fig.update_xaxes(tickangle=-45)
fig.show()
assoc_actors = df['assoc_actor_1'].dropna().tolist()
assoc_actors_cleaned = []
for actor in assoc_actors:
if ';' in actor:
assoc_actors_cleaned.extend(actor.split(';'))
else:
assoc_actors_cleaned.append(actor)
assoc_actors_cleaned = [actor.strip() for actor in assoc_actors_cleaned]
assoc_actor_counts = Counter(assoc_actors_cleaned)
sorted_actors = sorted(assoc_actor_counts, key=assoc_actor_counts.get, reverse=True)
top_50_actors = sorted_actors[:50]
top_50_df = pd.DataFrame.from_dict({actor: assoc_actor_counts[actor] for actor in top_50_actors}, orient='index', columns=['Count']).reset_index()
top_50_df.rename(columns={'index': 'Actor'}, inplace=True)
#fig.update_xaxes(tickangle=45)
fig = px.bar(top_50_df, x='Actor', y='Count', title='Top 50 Associated Actors')
fig.show()
(BLM: Black Lives Matter, Students, Labour Group, Women, Government of the United States, African American Group)
actor2_counts = df['actor2'].value_counts().reset_index()
actor2_counts.columns = ['actor2', 'count']
fig = px.bar(actor2_counts, x="actor2", y="count", title="Count of Actor2")
fig.update_xaxes(tickangle=-45)
fig.show()
assoc_actors_2 = df['assoc_actor_2'].dropna().tolist()
assoc_actors_2_cleaned = []
for actor in assoc_actors_2:
if ';' in actor:
assoc_actors_2_cleaned.extend(actor.split(';'))
else:
assoc_actors_2_cleaned.append(actor)
assoc_actors_2_cleaned = [actor.strip() for actor in assoc_actors_2_cleaned]
assoc_actor2_counts = Counter(assoc_actors_2_cleaned)
sorted_actors_2 = sorted(assoc_actor2_counts, key=assoc_actor2_counts.get, reverse=True)
top_50_actors = sorted_actors_2[:50]
top_50_df = pd.DataFrame.from_dict({actor: assoc_actor2_counts[actor] for actor in top_50_actors}, orient='index', columns=['Count']).reset_index()
top_50_df.rename(columns={'index': 'Actor'}, inplace=True)
fig = px.bar(top_50_df, x='Actor', y='Count', title='Top 50 Associated Actors 2')
fig.update_xaxes(tickangle=-45)
fig.show()
By Analyzing the Actor1; Assoc Actor 1 - Actor 2; Assoc Actor2. We can try to determine some interesting insights into their relation.
interaction_counts = df['interaction'].value_counts().reset_index()
interaction_counts.columns = ['interaction', 'count']
fig = px.pie(interaction_counts, names="interaction", values="count",hole = 0.3, title="Count of Protests per interaction")
fig.update_xaxes(tickangle=-90)
fig.show()
admin1_counts = df['admin1'].value_counts().reset_index()
admin1_counts.columns = ['admin1', 'count']
fig = px.bar(admin1_counts, x="admin1", y="count", title="Count of Admin1")
fig.update_xaxes(tickangle=-45)
fig.show()
Here we can see that most of the States in questions are the states with large cities with large population. Most of the population centers in the United States and modern working public are located here, it makes sense that the top 10 here would be ideally the top 10 most populated and popular states in the Untied States.
It might be an interesting correaltion to see the event sub_type to the state.
admin2_counts = df['admin2'].value_counts().reset_index()
admin2_counts.columns = ['admin2', 'count']
top50_admin2_counts = admin2_counts.head(50)
fig = px.bar(top50_admin2_counts, x="admin2", y="count", title="Top 50 Count of Admin2")
fig.update_xaxes(tickangle=-45)
fig.show()
location_counts = df['location'].value_counts().reset_index()
location_counts.columns = ['location', 'count']
top50_location_counts = location_counts.head(50)
fig = px.bar(top50_location_counts, x="location", y="count", title="Top 50 Count of location")
fig.update_xaxes(tickangle=-45)
fig.show()
sources = df['source'].dropna().tolist()
sources_cleaned = []
for actor in sources:
if ';' in actor:
sources_cleaned.extend(actor.split(';'))
else:
sources_cleaned.append(actor)
sources_cleaned = [actor.strip() for actor in sources_cleaned]
sources_counts = Counter(sources_cleaned)
sorted_sources = sorted(sources_counts, key=sources_counts.get, reverse=True)
top_50_actors = sorted_sources[:50]
top_50_df = pd.DataFrame.from_dict({actor: sources_counts[actor] for actor in top_50_actors}, orient='index', columns=['Count']).reset_index()
top_50_df.rename(columns={'index': 'Sources'}, inplace=True)
fig = px.bar(top_50_df, x='Sources', y='Count', title='Top 50 Associated Sources')
fig.update_xaxes(tickangle=-45)
fig.show()
Majority of the information is sourced from Three sources:
Not a lot of the data has been gathered from traditional media channels, this can speak to the quality of the data and the non legacy channel access to data available in this day and age.
source_scale_counts = df['source_scale'].value_counts().reset_index()
source_scale_counts.columns = ['source_scale', 'count']
top50_source_scale_counts = source_scale_counts.head(50)
fig = px.bar(top50_source_scale_counts, x="source_scale", y="count", title="Top 50 Count of source_scale")
fig.update_xaxes(tickangle=-45)
fig.show()
Based on the univariate analysis done, we are focusing on the United States as that is a majority.
df = df[df['actor1'].str.contains('United States', case=False)]
df.shape
(42238, 31)
df.describe()
| data_id | iso | event_id_no_cnty | event_date | time_precision | inter1 | inter2 | interaction | latitude | longitude | geo_precision | fatalities | timestamp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.223800e+04 | 42238.0 | 42238.000000 | 42238 | 42238.000000 | 42238.000000 | 42238.000000 | 42238.000000 | 42238.000000 | 42238.000000 | 42238.000000 | 42238.000000 | 4.223800e+04 |
| mean | 8.266746e+06 | 840.0 | 22430.918959 | 2021-02-01 18:58:12.779014144 | 1.005422 | 5.974620 | 0.330177 | 58.229627 | 38.370137 | -92.543588 | 1.007955 | 0.000710 | 1.627587e+09 |
| min | 7.593655e+06 | 840.0 | 1.000000 | 2020-01-01 00:00:00 | 1.000000 | 5.000000 | 0.000000 | 15.000000 | 19.442700 | -166.534100 | 1.000000 | 0.000000 | 1.612546e+09 |
| 25% | 7.616716e+06 | 840.0 | 11024.250000 | 2020-06-20 00:00:00 | 1.000000 | 6.000000 | 0.000000 | 60.000000 | 34.799800 | -106.651100 | 1.000000 | 0.000000 | 1.612547e+09 |
| 50% | 8.265290e+06 | 840.0 | 22444.500000 | 2020-11-17 00:00:00 | 1.000000 | 6.000000 | 0.000000 | 60.000000 | 39.219100 | -86.443600 | 1.000000 | 0.000000 | 1.624483e+09 |
| 75% | 8.680164e+06 | 840.0 | 33741.750000 | 2021-08-18 00:00:00 | 1.000000 | 6.000000 | 0.000000 | 60.000000 | 41.850000 | -77.615500 | 1.000000 | 0.000000 | 1.638920e+09 |
| max | 9.383708e+06 | 840.0 | 45052.000000 | 2022-06-24 00:00:00 | 3.000000 | 6.000000 | 8.000000 | 68.000000 | 71.290600 | -67.461400 | 3.000000 | 3.000000 | 1.656459e+09 |
| std | 5.543203e+05 | 0.0 | 13078.246226 | NaN | 0.074075 | 0.157278 | 1.280650 | 9.161994 | 5.195800 | 18.214649 | 0.097962 | 0.030766 | 1.388826e+07 |
Dropped the admin3 column and converted the appropriate data types as seen above
pair_freq = defaultdict(int)
for _, row in df.dropna(subset=['assoc_actor_1', 'assoc_actor_2']).iterrows():
actors_1 = [actor.strip() for actor in row['assoc_actor_1'].split(';')]
actors_2 = [actor.strip() for actor in row['assoc_actor_2'].split(';')]
for actor_1, actor_2 in itertools.product(actors_1, actors_2):
pair_freq[(actor_1, actor_2)] += 1
pair_freq_df = pd.DataFrame(list(pair_freq.items()), columns=['Actor Pair', 'Frequency'])
pair_freq_df_sorted = pair_freq_df.sort_values(by='Frequency', ascending=False).reset_index(drop=True)
pair_freq_df_sorted.head(10)
| Actor Pair | Frequency | |
|---|---|---|
| 0 | (BLM: Black Lives Matter, Pro-Police Group (Un... | 173 |
| 1 | (Pro-Police Group (United States), BLM: Black ... | 105 |
| 2 | (Back the Blue, BLM: Black Lives Matter) | 60 |
| 3 | (BLM: Black Lives Matter, Back the Blue) | 53 |
| 4 | (BLM: Black Lives Matter, Civilians (United St... | 38 |
| 5 | (Proud Boys, Antifa (United States)) | 36 |
| 6 | (Students (United States), Students (United St... | 35 |
| 7 | (Women (United States), Students (United States)) | 31 |
| 8 | (Women (United States), Christian Group (Unite... | 30 |
| 9 | (BLM: Black Lives Matter, Journalists (United ... | 30 |
pair_freq_df_sorted['Actor Pair Str'] = pair_freq_df_sorted['Actor Pair'].apply(lambda x: f"{x[0]} & {x[1]}")
# As the values are stored in pairs, using the lamda function we are splitting
# them and spearating the Pairs using an & for understanding.
plt.figure(figsize=(10, 8))
sns.barplot(x='Frequency', y='Actor Pair Str', data=pair_freq_df_sorted.head(20))
plt.title('Top 20 Most Frequent Actor Pairs')
plt.xlabel('Frequency')
plt.ylabel('Actor Pairs')
plt.tight_layout()
plt.show()
df['interaction'] = df['interaction'].astype(str)
interaction_counts = df['interaction'].value_counts().reset_index()
interaction_counts.columns = ['interaction', 'count']
interaction_counts = interaction_counts[1:]
fig = px.pie(interaction_counts, names="interaction", values="count", title="Count of protests per interaction code")
fig.show()
blm_data = df[df['assoc_actor_1'].str.contains('BLM: Black Lives Matter', na=False)]
#Isolating all the rows with the mention of BLM in the associated actor
yearly_sub_event_counts = blm_data.groupby([blm_data['event_date'].dt.year, 'sub_event_type']).size().unstack(fill_value=0)
sns.heatmap(yearly_sub_event_counts, cmap='viridis', annot=True, fmt='d')
plt.title('Yearly Count of "BLM: Black Lives Matter" by Sub-event Type')
plt.xlabel('Sub-event Type')
plt.ylabel('Year')
plt.show()
We can see that majority of the Protests by the BLM have been peaceful protests although not entirely. There are also not so insiginificant instances of Force being used.
The impact of the movement in the year 2020 was incredibly high as that can be seen, but the mobilization of the movement did not sustain over the next two years. As we can see from the graph the number from this movement decreasedover the next two years.
BLM is the largeest contributor to the events in 2020 thus explaining the huge spike that we previously observed.
blm_data.columns
Index(['data_id', 'iso', 'event_id_cnty', 'event_id_no_cnty', 'event_date',
'year', 'time_precision', 'event_type', 'sub_event_type', 'actor1',
'assoc_actor_1', 'inter1', 'actor2', 'assoc_actor_2', 'inter2',
'interaction', 'region', 'country', 'admin1', 'admin2', 'location',
'latitude', 'longitude', 'geo_precision', 'source', 'source_scale',
'notes', 'fatalities', 'timestamp', 'iso3', 'year_month'],
dtype='object')
blm_data['year_month'] = blm_data['event_date'].dt.strftime('%Y-%m')
#creating a year-month for ease of analysis from the converted event_date
monthly_protest_counts = blm_data.groupby(['year_month','admin1']).size().reset_index(name='protest_count')
#per month protest counts coloured by admin1 to check for a possible correlation
fig = px.bar(monthly_protest_counts, x='admin1', y='protest_count', color = 'year_month',
title='BLM State Wise Protest Count Per Year_Month',
labels={'year_month': 'Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\3871097022.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
def analyze_sentiment(text):
return TextBlob(text).sentiment.polarity
blm_data = df[df['assoc_actor_1'].str.contains('BLM: Black Lives Matter', na=False) & df['notes'].notna()]
blm_data['sentiment'] = blm_data['notes'].apply(analyze_sentiment)
wordcloud_text = ' '.join(blm_data['notes'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(wordcloud_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for "BLM: Black Lives Matter" in Assoc_actor_1')
plt.show()
plt.figure(figsize=(8, 5))
plt.hist(blm_data['sentiment'], bins=20, color='skyblue')
plt.title('Sentiment Distribution for "BLM: Black Lives Matter" in Assoc_actor_1')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\2395244002.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Doing a simple wordcloud shows us that in accordance with the data in 2020. Most of the protests and riots were revolving predominantly around Black Lives Matter
blm_source_counts = blm_data['source'].value_counts() #Getting the counts of the sources
blm_source_10 = blm_source_counts[:10] #As there are many sources, looking here at the top 10
plt.figure(figsize=(10, 5))
blm_source_10.plot(kind='bar', color='skyblue')
plt.title('Count of "BLM: Black Lives Matter" by Source Scale')
plt.xlabel('Source Scale')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
fig = px.scatter_mapbox(blm_data, lat="latitude", lon="longitude", hover_name="location", zoom=3, height=500)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(title='Map of "BLM: Black Lives Matter" Instances', margin={"r":0,"t":30,"l":0,"b":0})
fig.show()
students_data = df[df['assoc_actor_1'].str.contains('Students', na=False)]
yearly_sub_event_counts = students_data.groupby([students_data['event_date'].dt.year, 'sub_event_type']).size().unstack(fill_value=0)
plt.figure(figsize=(12, 6))
sns.heatmap(yearly_sub_event_counts, cmap='viridis', annot=True, fmt='d')
plt.title('Yearly Count of "Students (United States)" by Sub-event Type')
plt.xlabel('Sub-event Type')
plt.ylabel('Year')
plt.show()
students_data.columns
Index(['data_id', 'iso', 'event_id_cnty', 'event_id_no_cnty', 'event_date',
'year', 'time_precision', 'event_type', 'sub_event_type', 'actor1',
'assoc_actor_1', 'inter1', 'actor2', 'assoc_actor_2', 'inter2',
'interaction', 'region', 'country', 'admin1', 'admin2', 'location',
'latitude', 'longitude', 'geo_precision', 'source', 'source_scale',
'notes', 'fatalities', 'timestamp', 'iso3', 'year_month'],
dtype='object')
students_data['month'] = students_data['event_date'].dt.strftime('%m')
monthly_protest_counts = students_data.groupby(['month','admin1']).size().reset_index(name='protest_count')
fig = px.bar(monthly_protest_counts, x='admin1', y='protest_count', color = 'month',
title='students Monthly Protest Count Per Year',
labels={'year_month': 'Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\3831895891.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
students_notes = ' '.join(df[df['assoc_actor_1'].str.contains('Students', na=False)]['notes'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(students_notes)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for "Students" in Assoc_actor_1')
plt.show()
students_source_counts = students_data['source'].value_counts()
students_source_10 = students_source_counts[:10]
plt.figure(figsize=(10, 5))
students_source_10.plot(kind='bar', color='skyblue')
plt.title('Count of "students" by Source ')
plt.xlabel('Source')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
fig = px.scatter_mapbox(students_data, lat="latitude", lon="longitude", hover_name="location", zoom=3, height=500)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(title='Map of "students" Instances', margin={"r":0,"t":30,"l":0,"b":0})
fig.show()
labour_group_data = df[df['assoc_actor_1'].str.contains('Labour Group', na=False)]
yearly_sub_event_counts = labour_group_data.groupby([labour_group_data['event_date'].dt.year, 'sub_event_type']).size().unstack(fill_value=0)
plt.figure(figsize=(12, 6))
sns.heatmap(yearly_sub_event_counts, cmap='viridis', annot=True, fmt='d')
plt.title('Yearly Count of "labour_group (United States)" by Sub-event Type')
plt.xlabel('Sub-event Type')
plt.ylabel('Year')
plt.show()
labour_group_data['month'] = labour_group_data['event_date'].dt.strftime('%m')
monthly_protest_counts = labour_group_data.groupby(['month','admin1']).size().reset_index(name='protest_count')
fig = px.bar(monthly_protest_counts, x='admin1', y='protest_count', color = 'month',
title='labour_group Monthly Protest Count Per Year',
labels={'year_month': 'Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\3227680976.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
labour_group_notes = ' '.join(df[df['assoc_actor_1'].str.contains('Labour Group', na=False)]['notes'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(labour_group_notes)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for "labour_group" in Assoc_actor_1')
plt.show()
labour_group_source_counts = labour_group_data['source'].value_counts()
labour_group_source_10 = labour_group_source_counts[:10]
plt.figure(figsize=(10, 5))
labour_group_source_10.plot(kind='bar', color='skyblue')
plt.title('Count of "labour_group" by Source Scale')
plt.xlabel('Source Scale')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
fig = px.scatter_mapbox(labour_group_data, lat="latitude", lon="longitude", hover_name="location", zoom=3, height=500)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(title='Map of "Labour Group" Instances', margin={"r":0,"t":30,"l":0,"b":0})
fig.show()
women_data = df[df['assoc_actor_1'].str.contains('Women', na=False)]
yearly_sub_event_counts = women_data.groupby([women_data['event_date'].dt.year, 'sub_event_type']).size().unstack(fill_value=0)
plt.figure(figsize=(12, 6))
sns.heatmap(yearly_sub_event_counts, cmap='viridis', annot=True, fmt='d')
plt.title('Yearly Count of "Women (United States)" by Sub-event Type')
plt.xlabel('Sub-event Type')
plt.ylabel('Year')
plt.show()
women_data['month'] = women_data['event_date'].dt.strftime('%m')
monthly_protest_counts = women_data.groupby(['month','admin1']).size().reset_index(name='protest_count')
fig = px.bar(monthly_protest_counts, x='admin1', y='protest_count', color = 'month',
title='women Monthly Protest Count Per Year',
labels={'year_month': 'Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\3970196015.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
women_notes = ' '.join(df[df['assoc_actor_1'].str.contains('Women', na=False)]['notes'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(women_notes)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for "women" in Assoc_actor_1')
plt.show()
women_source_counts = women_data['source'].value_counts()
women_source_count_10 = women_source_counts[:10]
plt.figure(figsize=(10, 5))
women_source_count_10.plot(kind='bar', color='skyblue')
plt.title('Count of "women" by Source Scale')
plt.xlabel('Source Scale')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
fig = px.scatter_mapbox(women_data, lat="latitude", lon="longitude", hover_name="location", zoom=3, height=500)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(title='Map of "Woman" Instances', margin={"r":0,"t":30,"l":0,"b":0})
fig.show()
govt_data = df[df['assoc_actor_1'].str.contains('Government of the United States', na=False)]
yearly_sub_event_counts = govt_data.groupby([govt_data['event_date'].dt.year, 'sub_event_type']).size().unstack(fill_value=0)
plt.figure(figsize=(12, 6))
sns.heatmap(yearly_sub_event_counts, cmap='viridis', annot=True, fmt='d')
plt.title('Yearly Count of "Govt (United States)" by Sub-event Type')
plt.xlabel('Sub-event Type')
plt.ylabel('Year')
plt.show()
govt_data['month'] = govt_data['event_date'].dt.strftime('%m')
monthly_protest_counts = govt_data.groupby(['month','admin1']).size().reset_index(name='protest_count')
fig = px.bar(monthly_protest_counts, x='admin1', y='protest_count', color = 'month',
title='govt Monthly Protest Count Per Year',
labels={'year_month': 'Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\2211750188.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
govt_notes = ' '.join(df[df['assoc_actor_1'].str.contains('Government of the United States', na=False)]['notes'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(govt_notes)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for "govt" in Assoc_actor_1')
plt.show()
govt_source_counts = govt_data['source'].value_counts()
govt_source_count_10 = govt_source_counts[:10]
plt.figure(figsize=(10, 5))
govt_source_count_10.plot(kind='bar', color='skyblue')
plt.title('Count of "govt" by Source Scale')
plt.xlabel('Source Scale')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
fig = px.scatter_mapbox(govt_data, lat="latitude", lon="longitude", hover_name="location", zoom=3, height=500)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(title='Map of "Government of the United States" Instances', margin={"r":0,"t":30,"l":0,"b":0})
fig.show()
The map shows high-density clusters of protests in major urban areas such as the East and West Coasts, particularly around cities like Los Angeles, New York, and Washington D.C. This is consistent with population density and the presence of institutional structures that are often the focus of demonstrations.
blm_data['assoc_actor'] = 'BLM'
students_data['assoc_actor'] = 'Students'
labour_group_data['assoc_actor'] = 'Labour Group'
women_data['assoc_actor'] = 'Women'
govt_data['assoc_actor'] = 'Government'
# Combining all the maps gathered from the previous data to visualize them at the same glance
# to undestand and discern the differences between the top 5 associated actors
combined_data = pd.concat([blm_data, students_data, labour_group_data, women_data, govt_data])
fig = px.scatter_mapbox(combined_data, lat="latitude", lon="longitude",
color="assoc_actor", hover_name="location",
zoom=3, height=500, title='Combined Map of Various Associated Actors ')
fig.update_layout(mapbox_style="open-street-map",
mapbox_zoom=3,
mapbox_center={"lat": 37.0902, "lon": -95.7129},
margin={"r":0,"t":0,"l":0,"b":0},
legend_title_text='assoc_actor')
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\812147166.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\812147166.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\812147166.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\812147166.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\812147166.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_filtered = df[df['event_date'].dt.year == 2020] #filtering our for the year 2020
df_filtered['year_month'] = df_filtered['event_date'].dt.strftime('%Y-%m')
#For a month-year value taking the Y/m from the datetype converted event_date
monthly_protest_counts = df_filtered.groupby(['year_month', 'event_type']).size().to_frame(name='protest_count').reset_index()
#grouping by year_month to get the protest counts per month by the event_type color
fig = px.bar(monthly_protest_counts, x='year_month', y='protest_count', color = 'event_type',
title='Monthly Protest Count in 2020',
labels={'year_month': 'Year-Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\1430075707.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Top 3 months are May > August > July. Lets explore them further.
df_filtered_riot = df_filtered[df_filtered['event_type'] == 'Riots']
df_filtered_protests = df_filtered[df_filtered['event_type'] == 'Protests']
df_filtered_riot_day = df_filtered_riot.groupby('event_date')['event_type'].count().reset_index(name='riot_count')
fig = px.bar(df_filtered_riot_day, x='event_date', y='riot_count', title='Daily Riot Count in 2020')
#Daily Riot count for the year 2020
fig.show()
df_filtered_protests_day = df_filtered_protests.groupby('event_date')['event_type'].count().reset_index(name='protest_count')
fig = px.bar(df_filtered_protests_day, x='event_date', y='protest_count', title='Daily Protest Count in 2020')
#Daily protest count for the year 2020
fig.show()
df = pd.read_csv("North_America-United_States.csv")
df["event_date"] = pd.to_datetime(df["event_date"])
filtered_df = df[(df['year'] == 2020) &
(df['event_type'] == 'Riots')]
#Selecting a subset of only the Riots in 2020
fig = make_subplots(rows=2, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}],
[{'type': 'domain'}, {'type': 'domain'}]],
subplot_titles=('May 2020', 'June 2020', 'July 2020', 'August 2020'))
months = ['May', 'June', 'July', 'August']
subplot_positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
for month, position in zip(months, subplot_positions):
month_df = filtered_df[filtered_df['event_date'].dt.strftime('%B') == month]
assoc_actor_counts = month_df['assoc_actor_1'].str.split(';').explode().value_counts()
#the associated actors in many rows are seperated by ';'. Dealing with that to get the accurate value counts
#of the associated actors.
top_contributors = assoc_actor_counts.head(5)
labels = top_contributors.index
values = top_contributors.values
fig.add_trace(go.Pie(labels=labels, values=values, name=month), row=position[0], col=position[1])
fig.update_layout(
title=dict(text='Top 5 Contributors in Associated Actors for Riots (May-Aug 2020)', x=0.5),
font=dict(size=12),
width=800,
height=600
)
fig.show()
filtered_df = df[(df['year'] == 2020) &
(df['event_type'] == 'Protests')]
#Selecting a subset of only the Protests in 2020
fig = make_subplots(rows=2, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}],
[{'type': 'domain'}, {'type': 'domain'}]],
subplot_titles=('May 2020', 'June 2020', 'July 2020', 'August 2020'))
#Creating subplot for easy visibility
months = ['May', 'June', 'July', 'August']
subplot_positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
for month, position in zip(months, subplot_positions):
month_df = filtered_df[filtered_df['event_date'].dt.strftime('%B') == month]
assoc_actor_counts = month_df['assoc_actor_1'].str.split(';').explode().value_counts()
#the associated actors in many rows are seperated by ';'. Dealing with that to get the accurate value counts
#of the associated actors.
top_contributors = assoc_actor_counts.head(5)
labels = top_contributors.index
values = top_contributors.values
fig.add_trace(go.Pie(labels=labels, values=values, name=month), row=position[0], col=position[1])
fig.update_layout(
title=dict(text='Top 5 Contributors in Associated Actors for Protests (May-Aug 2020)', x=0.5),
font=dict(size=12),
width=800,
height=600
)
fig.show()
filtered_df = df[(df['event_date'].dt.year == 2020) &
(df['event_date'].dt.month.isin([5, 6, 7, 8]))]
#Choosing the months when there was a peak in activity in 2020
notes = ' '.join(filtered_df['notes'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(notes)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Notes (May-August 2020)')
plt.show()
df_filtered = df[df['event_date'].dt.year == 2021]
df_filtered['year_month'] = df_filtered['event_date'].dt.strftime('%Y-%m')
monthly_protest_counts = df_filtered.groupby(['year_month', 'event_type']).size().to_frame(name='protest_count').reset_index()
#Like we previously did, getting the event_count for each month in 2021 coloured by event_type
fig = px.bar(monthly_protest_counts, x='year_month', y='protest_count', color = 'event_type',
title='Monthly Protest Count in 2021',
labels={'year_month': 'Year-Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\2835233933.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Top 3 months are May > August > July. Lets explore them further.
df_filtered_riot = df_filtered[df_filtered['event_type'] == 'Riots']
df_filtered_protests = df_filtered[df_filtered['event_type'] == 'Protests']
df_filtered_riot_day = df_filtered_riot.groupby('event_date')['event_type'].count().reset_index(name='riot_count')
fig = px.bar(df_filtered_riot_day, x='event_date', y='riot_count', title='Daily Riot Count in 2021')
#Daily Riot count in the year 2021
fig.show()
df_filtered_protests_day = df_filtered_protests.groupby('event_date')['event_type'].count().reset_index(name='protest_count')
fig = px.bar(df_filtered_protests_day, x='event_date', y='protest_count', title='Daily Protest Count in 2021')
# Daily protest count in the year 2021
fig.show()
filtered_df = df[(df['year'] == 2021) &
(df['event_type'] == 'Protests')]
fig = make_subplots(rows=2, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}],
[{'type': 'domain'}, {'type': 'domain'}]],
subplot_titles=('January 2021', 'March 2021', 'May 2021', 'October 2021'))
months = ['January', 'March', 'May', 'October']
subplot_positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
for month, position in zip(months, subplot_positions):
month_df = filtered_df[filtered_df['event_date'].dt.strftime('%B') == month]
assoc_actor_counts = month_df['assoc_actor_1'].str.split(';').explode().value_counts()
#similarly as above instances splitting and getting the counts for the associated actors
top_contributors = assoc_actor_counts.head(5)
labels = top_contributors.index
values = top_contributors.values
fig.add_trace(go.Pie(labels=labels, values=values, name=month), row=position[0], col=position[1])
fig.update_layout(
title=dict(text='Top 5 Contributors in Associated Actors for Protests (January, March, May, Oct 2021)', x=0.5),
font=dict(size=12),
width=800,
height=600
)
fig.show()
filtered_df = df[(df['event_date'].dt.year == 2021) &
(df['event_date'].dt.month.isin([1, 3, 5, 10]))]#selecting the months which showed peak in activity
notes = ' '.join(filtered_df['notes'].dropna())#combining information from the notes for wordcloud
wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(notes)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Notes (January, March, May, Oct 2021)')
plt.show()
df_filtered = df[df['event_date'].dt.year == 2022] # selecting only the year 2022
df_filtered['year_month'] = df_filtered['event_date'].dt.strftime('%Y-%m') #Creating a year-month for understanding
monthly_protest_counts = df_filtered.groupby(['year_month', 'event_type']).size().to_frame(name='protest_count').reset_index()
#getting the count of protests per month in the year 2022
fig = px.bar(monthly_protest_counts, x='year_month', y='protest_count', color = 'event_type',
title='Monthly Protest Count in 2022',
labels={'year_month': 'Year-Month', 'protest_count': 'Number of Protests'})
fig.update_xaxes(tickangle=-45)
fig.show()
C:\Users\ashwi\AppData\Local\Temp\ipykernel_9088\2695222818.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Top 3 months are May > August > July. Lets explore them further.
df_filtered_riot = df_filtered[df_filtered['event_type'] == 'Riots']
df_filtered_protests = df_filtered[df_filtered['event_type'] == 'Protests']
sub_event_counts = df_filtered_protests['sub_event_type'].value_counts().reset_index()
sub_event_counts.columns = ['sub_event_type', 'count'] # Getting the Sub_event_types and their counts.
fig = px.pie(sub_event_counts, names="sub_event_type", values="count",hole = 0.3, title="Count of Protests Sub Event types")
fig.show()
# contribution of sub_event type for the year 2022.
df_filtered_riot_day = df_filtered_riot.groupby('event_date')['event_type'].count().reset_index(name='riot_count')
fig = px.bar(df_filtered_riot_day, x='event_date', y='riot_count', title='Daily Riot Count in 2022')
#Daily riot count in 2022 to see and analyze , time of the year for the spikes
fig.show()
df_filtered_protests_day = df_filtered_protests.groupby('event_date')['event_type'].count().reset_index(name='protest_count')
fig = px.bar(df_filtered_protests_day, x='event_date', y='protest_count', title='Daily Protest Count in 2022')
#Daily Protest count in 2022 to see and analyze , time of the year for the spikes
fig.show()
filtered_df = df[(df['year'] == 2022) &
(df['event_type'] == 'Protests')]
# we are filtering out all the protest instances in 2022, to understand the Associated actors at play.
fig = make_subplots(rows=2, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}],
[{'type': 'domain'}, {'type': 'domain'}]],
subplot_titles=('January 2022', 'March 2022', 'May 2022', 'June 2022'))
months = ['January', 'March', 'May', 'June']
subplot_positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
for month, position in zip(months, subplot_positions):
month_df = filtered_df[filtered_df['event_date'].dt.strftime('%B') == month]
assoc_actor_counts = month_df['assoc_actor_1'].str.split(';').explode().value_counts()
top_contributors = assoc_actor_counts.head(5)
# Through this for loop we are considering 1 month at a time, dealing with the associated actors appropriately
# Using that we are getting the Associated actor counts for the top 4 months
labels = top_contributors.index
values = top_contributors.values
fig.add_trace(go.Pie(labels=labels, values=values, name=month), row=position[0], col=position[1])
fig.update_layout(
title=dict(text='Top 5 Contributors in Associated Actors for Protests (January, March, May, Jun 2022)', x=0.5),
font=dict(size=12),
width=800,
height=600
)
fig.show()
filtered_df = df[(df['event_date'].dt.year == 2022) &
(df['event_date'].dt.month.isin([1, 3, 5, 6]))]
# Like we have done previously, we are generating a word clound for the top 4 peak months in the year 2022
notes = ' '.join(filtered_df['notes'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(notes)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Notes (January, March, May, Jun 2022)')
plt.show()
Based on Actor 1 removed all the rows that had protests and riots outside of the united states.
Most of the data in the dataset as we can see from the pie chart comes from 2020. We can further delve deep into understanding:
From the previous plot we have observed that majority of the data comes from 2020. This bar corroborates that idea.
We can see that overall protests have been higher compared to riots. So by definition of the data, there has been less violent outbreaks. It will be interesting to see the breakup for specific groups.
In line with the previous graph, we can see that most of the Sub events are Peaceful Protests. This also explains the majority of it being protests.
Compared to United States the we can see that the information at our disposal regarding Protests and Riots outside of the United States is limited. In this scenario as the data is very skewed it might not give an accurate representation of the international data. Those should be dealt with while continuing the analysis.
Seeing the Top 50 gives a very broad and unimportant overview of the entire feature. By zooming in to the top 5, it can provide a snapshot into the analysis as they do explain a large chunk of the Associated actors. And the Top 5 seem to be:
(BLM: Black Lives Matter, Students, Labour Group, Women, Government of the United States, African American Group)
As we saw for Actor1, it is the same case here where most of the data is skewed by the United States and the rest are very less. This can also be dealt with accordingly.
By Analyzing the Actor1; Assoc Actor 1 - Actor 2; Assoc Actor2. We can try to determine some interesting insights into their relation.
At this stage this is fairly unimportant as we do not have more information on the codes and their meaning and as 90% + are skewed in favour of the Interaction code 60 - We can avoid this for analysis.
Here we can see that most of the States in questions are the states with large cities with large populations. Most of the population centers in the United States and the modern working public are located here, it makes sense that the top 10 here would be ideally the top 10 most populated and popular states in the Untied States.
Majority of the information is sourced from Three sources:
The heatmap provides a count of BLM-related protests across different categories of protest intensity:
This visualization shows BLM protest counts by month and state, highlighting:
The word cloud indicates frequently occurring words in relation to BLM protests, giving insight into the context and issues:
The bar chart detailing the count of BLM-related protests by media sources suggests:
The map displays the geographical spread of BLM-related protests across the United States:
The bar chart represents the count of student-related protests as reported by different news sources or platforms. Upon closer examination, we can observe:
Media Engagement: The leading source, Red & Black, has the highest count, which suggests that this source is particularly engaged in student affairs or located in a region with high protest activity. If this source is affiliated with a university or a student body, it may imply that it is a primary outlet for student voices and issues.
Variability Among Outlets: There is a distribution of counts among sources, with some like MLive, Fight Back News, and Indybay following closely behind. This could indicate regional reporting tendencies or different outlets focusing on different aspects of student activism.
Platform Types: The presence of Twitter suggests that social media is also a significant platform for reporting or discussing student protests. It highlights the importance of social media in mobilizing, organizing, or spreading awareness about student protests.
By examining the source of the reports, stakeholders can better understand where to look for information about student protests and how different media perceptions might influence public understanding.
The geographical distribution map plots the locations of student protests across the United States. This map reveals several key patterns:
Density in Urban and Coastal Areas: There's a clear density of protests in urban areas and along the coasts, particularly the East and West Coasts. These areas typically host large universities and colleges with significant student populations.
Hotspots of Activism: Certain states like California, New York, and others along the coastlines show high concentrations of protests. This could correlate with political activism, student population size, and social issues prevalent in these states.
Midwest and Rural Representation: While the coastal areas are more densely populated with protest activity, there's still a fair representation in the Midwest and some rural areas. The nature of these protests may differ and could be influenced by local or state issues pertinent to the student bodies in these regions.
Regional Issues: The map indicates that student protests are not isolated to any one area, suggesting a variety of issues motivate these protests, possibly reflecting regional concerns like local government policies, state educational reforms, or region-specific social issues.
Access to Education and Diversity: The widespread distribution also suggests that access to higher education and the diversity of student bodies are nationwide phenomena, and thus, the potential for student activism is not limited by geography.
Influence of Institutions: Institutions located in these hotspots likely play a significant role in nurturing a culture of activism. The map could be cross-referenced with locations of major universities to understand the influence of institutional culture on protest activity.
Infrastructure for Protest: Urban areas, especially those with higher education institutions, may provide the necessary infrastructure—such as public spaces and organizational networks—that facilitates protests.
The heatmap presents a count of protests by students from 2020 to 2022 categorized by the type of event. Key observations are:
This stacked bar chart displays the monthly count of student protests across various states. Insights include:
The word cloud visualizes the most frequently occurring words in the context of student protests. The largest words offer the most salient insights:
The bar chart showcases the number of protests reported by various news sources. This indicates:
The map illustrates the geographical spread of student protests across the United States. Observations include:
The heatmap illustrates the count of different types of labor-related protests from 2020 to 2022. Key insights are:
This visualization breaks down the monthly protest counts across different states, and several trends can be identified:
The word cloud for "Labour Group" provides insights into the most frequent terms associated with these protests:
This bar chart measures the number of protests reported by different sources or scales:
The geographical map shows the spread of labor group protests across the United States:
The most frequent actor pair is "(BLM: Black Lives Matter, Pro-Police Group (United States))" with a frequency of 173. This suggests a significant number of interactions or co-occurrences between the Black Lives Matter movement and pro-police groups in the United States.
The second most frequent pair is "((Pro-Police Group (United States), BLM: Black Lives Matter)" with a frequency of 105. This pair is similar to the top pair but with a different order, indicating that the dataset captures interactions in both directions.
Other prominent pairs include various combinations of Black Lives Matter, Back the Blue, Civilians, Proud Boys, Antifa, Students, Women, Christian Groups, and Journalists. This diversity of actor pairs suggests a complex network of interactions among different social, political, and ideological groups.
The presence of "Back the Blue" and "Pro-Police Group" in multiple pairs highlights the significant role of police-related groups in the protest landscape.
The inclusion of "Students," "Women," and "Christian Group" in the top pairs indicates the involvement of these demographic and religious groups in the protest events.
Top 20 Most Frequent Actor Pairs
The chart clearly shows the dominance of pairs involving the Black Lives Matter movement and pro-police groups, with "BLM: Black Lives Matter & Pro-Police Group (United States)" being the most frequent pair by a significant margin.
The chart also highlights the frequent co-occurrence of "Back the Blue" and "BLM: Black Lives Matter," suggesting a strong interaction between these two groups.
Pairs involving "Proud Boys," "Antifa," "Students," and "Women" are also prominent, indicating their active involvement in the protest events.
The presence of "Government of the United States (2017-2021)" in the top pairs suggests the involvement of the government as an actor in the protest landscape.
The chart also includes pairs with "Journalists," "Sole Perpetrator," "Police Forces of the United States," and "Unidentified Communal Militia," highlighting the diversity of actors involved in the protests.
Checking the frequency of the interaction codes apart from 60 which is just for sole protestors, we can see and understand at the actor level the association between them. Clubbing these with the insights gathered previously can show some interesting information.
This heatmap presents the count of various government-related protests from 2020 to 2022, across different levels of confrontation:
This bar chart breaks down the protest counts by month and state, giving a view into the temporal and geographical distribution of these events:
The word cloud highlights frequently occurring terms related to government-related protests:
This bar chart depicts the count of government-related protest reports by different news sources and scales:
The geographical map shows the spread of government-related protests across the country:
High-Density Areas: The map shows high-density clusters of protests in major urban areas such as the East and West Coasts, particularly around cities like Los Angeles, New York, and Washington D.C. This is consistent with population density and the presence of institutional structures that are often the focus of demonstrations.
BLM Protests: The BLM protests are widespread but show significant concentrations in urban areas, reflecting the national conversation on racial justice and the reaction to incidents of police brutality. Their presence across the map indicates the extensive reach of the movement.
Student Protests: The distribution of student protests correlates with locations of major universities and colleges, with a noticeable presence in college towns across the country. The pattern suggests a youth-led movement deeply rooted in educational institutions.
Labour Group Protests: These protests are spread out but appear more concentrated in areas with a history of labor union activity and industrial centers. They are not as densely clustered as BLM protests, indicating more localized issues driving labor demonstrations.
Women's Protests: The protests related to women’s issues, which are likely to involve reproductive rights and gender equality, also appear in both urban and rural areas but are particularly prominent in regions where such issues are highly legislated or debated.
Government Protests: These are broadly distributed, highlighting the omnipresence of civic engagement and the propensity of citizens to assemble in response to political actions, policy changes, and governmental decisions.
Intensity: BLM protests, given their genesis in response to acute and highly publicized incidents, may carry an intensity reflected in the clusters on the map. Where BLM protests are dense, it might indicate a strong community response to racial justice issues.
Interconnected Movements: There's potential overlap in areas where multiple colors converge on the map. For example, student and BLM protests may intersect on issues of campus policing, while labor and government protests might converge over public sector working conditions.
Seasonal and Temporal Factors: Certain areas may show cyclical patterns of protests related to specific times of the year, such as Women's History Month or Labor Day, which are not immediately apparent on the map but important for nuanced understanding.
Socio-Political Context: The overlay of different protest types indicates that some regions are hotbeds for multiple forms of activism, possibly reflecting a vibrant socio-political culture that encourages public demonstration across a spectrum of issues.
Broader Social Movements: The coalescence of various protest types suggests broader social movements and coalitions forming around shared goals or in opposition to common challenges.
Public Policy Response: Dense areas of protest activity might be indicative of regions where public policy is not aligned with the populace's demands or where political tension is most palpable.
Resource Allocation for Civic Management: Law enforcement and public safety officials could use this map to allocate resources more effectively, ensuring peaceful protests can occur without undue escalation.
Media Coverage and Public Perception: The map also has implications for media coverage, as journalists might focus on areas with high protest density, potentially skewing public perception of the ubiquity or intensity of these issues.
This bar chart tracks the number of protests (blue) and riots (red) per month throughout 2020. The most prominent feature is the large peak in June, which likely corresponds to events triggered by a significant occurrence, possibly the George Floyd incident leading to worldwide Black Lives Matter protests. There are smaller, yet notable, amounts of protests and riots in the other months, suggesting a year of continued civil unrest. The presence of red in the bars indicates that some protests each month escalated into riots, with June having the highest count of such events.
This line graph shows the number of riots each day in 2020. The most striking feature is the spike on May 31st, indicating a day of particularly intense rioting, potentially aligned with the peak observed in the monthly count. The rest of the year shows a baseline level of rioting activity with some smaller spikes, suggesting intermittent periods of escalated tension.
Similar to the riot graph, this line graph details daily protests. The peak on June 6th aligns with the surge in the monthly data and confirms a period of heightened protest activity, possibly in response to the same event that triggered the riots. The frequency and intensity of protests seem to decline as the year progresses but remain present, indicating ongoing societal issues or responses to events.
This series of pie charts shows the proportion of different groups involved in riots from May to August 2020. In May, a vast majority is attributed to BLM, with a much smaller percentage to other groups. This dominance decreases slightly over the following months but remains the majority. This data suggests that the BLM movement was the main driver of riots during this period, likely in response to incidents of police brutality and racial injustice.
This set of pie charts categorizes the primary contributors to protests during the same period. The BLM movement again dominates in May and June but decreases over July and August. Notably, in August, the proportions are more evenly spread among different actors, suggesting diversification of protest motivations over time.
The word cloud visually emphasizes the most frequently used words in notes related to the events. Words like "protest," "police," "movement," "George Floyd," and "BLM" are prominent, indicating their significance in the narrative of 2020's protests and riots. The size of the words "pandemic" and "coronavirus" reflect the backdrop of COVID-19 during these events.
The data illustrates a year of significant civil unrest, with the BLM movement being a major contributor to protests and riots, particularly in the middle of the year. The impact of the George Floyd incident is seen with spikes in activity and the prevalence of related terms in communications. The persistent level of activity throughout the year suggests a broad range of issues sparking public action, not limited to but heavily influenced by racial tensions and the response to police actions.
Insights drawn from these visualizations could inform understanding of civil movement patterns, governmental and societal responses, and the impact of specific events on national stability. These analyses also underscore the influence of broader contexts such as the pandemic on civil unrest dynamics.
Monthly Protest Count in 2021:
Daily Riot and Protest Count in 2021:
Top 5 Contributors in Associated Actors for Protests:
Word Cloud of Notes:
Monthly Protest Count in 2022: The bar chart depicts the number of protests each month, with a notable peak in May. This suggests a significant event or series of events triggered a higher level of civil activity or unrest during that month. It's interesting to note that while there's a substantial number of protests, the accompanying portion of riots is relatively low, indicating that most of these protests did not devolve into riots.
Count of Protests Sub Event types: The donut chart shows the distribution of protest sub-event types, overwhelmingly dominated by peaceful protests. A very small fraction represents protests with interventions and even fewer instances of excessive force against protesters. This suggests that while there were many protests, they were largely non-violent, and only a small fraction faced intervention, with even fewer resulting in excessive force.
Word Cloud of Notes: The word cloud visualizes the most frequent terms in notes related to protests. Prominent terms like "Roe," "Wade," "Supreme Court," and "gun control" suggest that judicial decisions and legislation are key motivators for protests. The appearance of "Ukraine" and "Russia" could indicate international issues also played a significant role in civil activities.
Top 5 Contributors in Associated Actors for Protests: The pie charts for January, March, May, and June 2022 show the proportion of actors associated with protests. In May, there is a large section labeled "Women (United States)," which could correlate with the peak in protests that month, perhaps due to Abortion events or strikes. "Government of the United States (2021-)" and "Students (United States)" are consistently present.
Daily Riot Count in 2022: This visualization shows the number of riots occurring each day. There's variability, but on most days, the number is low. There are a few spikes.
Daily Protest Count in 2022: The data indicates more frequent protest activity than riots, with significant spikes that could align with major events or announcements, particularly noticeable in late June.